{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Selecting dimensionality reduction with Pipeline and GridSearchCV\n\n\nThis example constructs a pipeline that does dimensionality\nreduction followed by prediction with a support vector\nclassifier. It demonstrates the use of ``GridSearchCV`` and\n``Pipeline`` to optimize over different classes of estimators in a\nsingle CV run -- unsupervised ``PCA`` and ``NMF`` dimensionality\nreductions are compared to univariate feature selection during\nthe grid search.\n\nAdditionally, ``Pipeline`` can be instantiated with the ``memory``\nargument to memoize the transformers within the pipeline, avoiding to fit\nagain the same transformers over and over.\n\nNote that the use of ``memory`` to enable caching becomes interesting when the\nfitting of a transformer is costly.\n\n###############################################################################\nIllustration of ``Pipeline`` and ``GridSearchCV``\n###############################################################################\n\nThis section illustrates the use of a ``Pipeline`` with ``GridSearchCV``\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: Robert McGibbon, Joel Nothman, Guillaume Lemaitre\n\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.svm import LinearSVC\nfrom sklearn.decomposition import PCA, NMF\nfrom sklearn.feature_selection import SelectKBest, chi2\n\nprint(__doc__)\n\npipe = Pipeline([\n    # the reduce_dim stage is populated by the param_grid\n    ('reduce_dim', 'passthrough'),\n    ('classify', LinearSVC(dual=False, max_iter=10000))\n])\n\nN_FEATURES_OPTIONS = [2, 4, 8]\nC_OPTIONS = [1, 10, 100, 1000]\nparam_grid = [\n    {\n        'reduce_dim': [PCA(iterated_power=7), NMF()],\n        'reduce_dim__n_components': N_FEATURES_OPTIONS,\n        'classify__C': C_OPTIONS\n    },\n    {\n        'reduce_dim': [SelectKBest(chi2)],\n        'reduce_dim__k': N_FEATURES_OPTIONS,\n        'classify__C': C_OPTIONS\n    },\n]\nreducer_labels = ['PCA', 'NMF', 'KBest(chi2)']\n\ngrid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)\nX, y = load_digits(return_X_y=True)\ngrid.fit(X, y)\n\nmean_scores = np.array(grid.cv_results_['mean_test_score'])\n# scores are in the order of param_grid iteration, which is alphabetical\nmean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))\n# select score for best C\nmean_scores = mean_scores.max(axis=0)\nbar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *\n               (len(reducer_labels) + 1) + .5)\n\nplt.figure()\nCOLORS = 'bgrcmyk'\nfor i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):\n    plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])\n\nplt.title(\"Comparing feature reduction techniques\")\nplt.xlabel('Reduced number of features')\nplt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)\nplt.ylabel('Digit classification accuracy')\nplt.ylim((0, 1))\nplt.legend(loc='upper left')\n\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Caching transformers within a ``Pipeline``\n##############################################################################\n It is sometimes worthwhile storing the state of a specific transformer\n since it could be used again. Using a pipeline in ``GridSearchCV`` triggers\n such situations. Therefore, we use the argument ``memory`` to enable caching.\n\n .. warning::\n     Note that this example is, however, only an illustration since for this\n     specific case fitting PCA is not necessarily slower than loading the\n     cache. Hence, use the ``memory`` constructor parameter when the fitting\n     of a transformer is costly.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from joblib import Memory\nfrom shutil import rmtree\n\n# Create a temporary folder to store the transformers of the pipeline\nlocation = 'cachedir'\nmemory = Memory(location=location, verbose=10)\ncached_pipe = Pipeline([('reduce_dim', PCA()),\n                        ('classify', LinearSVC(dual=False, max_iter=10000))],\n                       memory=memory)\n\n# This time, a cached pipeline will be used within the grid search\n\n\n# Delete the temporary cache before exiting\nmemory.clear(warn=False)\nrmtree(location)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The ``PCA`` fitting is only computed at the evaluation of the first\nconfiguration of the ``C`` parameter of the ``LinearSVC`` classifier. The\nother configurations of ``C`` will trigger the loading of the cached ``PCA``\nestimator data, leading to save processing time. Therefore, the use of\ncaching the pipeline using ``memory`` is highly beneficial when fitting\na transformer is costly.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}